• PROJECT OBJECTIVE: To Demonstrate the ability to fetch, process and leverage data to generate useful predictions by training Supervised Learning algorithms.

  1. Data Understanding:

a. Read all the 3 CSV files as DataFrame and store them into 3 separate variables.

  1. Data Understanding:

b. Print Shape and columns of all the 3 DataFrames.

  1. Data Understanding:

c. Compare Column names of all the 3 DataFrames and clearly write observations.

  1. Data Understanding:

d. Print DataTypes of all the 3 DataFrames.

Every DataFrame consists of the following data types

  1. Data Understanding:

e. Observe and share variation in ‘Class’ feature of all the 3 DaraFrames.

Though all datapoints in a given DataFrame corresponds to a particular class of patients, the value in the 'Class' attribute has differences of spelling/uppercase/lowercase for the same class name, vis.,

This will be misinterpretted as different classes by any classifier machine learning algorithm. Hence needs to be corrected.

  1. Data Preparation and Exploration:

a. Unify all the variations in ‘Class’ feature for all the 3 DataFrames.

Based on the attributes and cross referencing public dataset, these data belong to a study of orthopediac patients built by Dr. Henrique da Mota

Accordingly, let us correct the class information as follows

  1. Data Preparation and Exploration:

b. Combine all the 3 DataFrames to form a single DataFrame

  1. Data Preparation and Exploration:

c. Print 5 random samples of this DataFrame

  1. Data Preparation and Exploration:

d. Print Feature-wise percentage of Null values.

Each column contains only valid data, as was also mentioned in ortho.info() as 310 non-null entries against each column

  1. Data Preparation and Exploration:

e. Check 5-point summary of the new DataFrame.

The features are located around a wide range of medians (11 to 120) also each attributes are at different scales and ranges, appropriate preprocessing is necessary before modelling

  1. Data Analysis:

a. Visualize a heatmap to understand correlation between all features

  1. Data Analysis:

b. Share insights on correlation.

i. Features having stronger correlation with correlation value.

ii. Features having weaker correlation with correlation value.

We could find low multicolinearity (attributes are not correlated heavily) as a dataset, which is a better for regressions to explain featurewise influence on resultant

Lets mention the top 2 & last 2 pairs -

there are no significant negative correlation pairs as strong as above

the feature pairs with weaker corrlation are -

though there are lower numeric values of correlations, they are higher negative correlations, hence the above 2 qualify as the least correlated pairs

  1. Data Analysis:

c. Visualize a pairplot with 3 classes distinguished by colors and share insights.

From the pairplot and kde distribution plot, we could infer the following considerable relation pairs would be

The color grouping helps identify the following

hopefully the classes are linearly separable in higher dimensions.

  1. Data Analysis:

d. Visualize a jointplot for ‘P_incidence’ and ‘S_slope’ and share insights.

As was seen earlier in correlation heat map visualisation and pairplot visualisation,

  1. Data Analysis:

e. Visualize a boxplot to check distribution of the features and share insights.

almost every feature has values centered around its median, with a few outliers, except for S_Degree, which is heavily right skewed, with one extreme offset outlier

the same is witnessed in Coefficient of Variance value of 1.42 of S_Degree

interestingly, P_radius seems to be have outliers on either side

  1. Model Building:

a. Split data into X and Y.

  1. Model Building:

b. Split data into train and test with 80:20 proportion.

  1. Model Building:

c. Train a Supervised Learning Classification base model using KNN classifier.

  1. Model Building:

d. Print all the possible classification metrics for both train and test data.

Being a medical test use case, recall scores are to be considered a critical parameter of evaluation.

A recall value of 0.75 for Disk Hernia and 0.853 for Spondylolisthesis needs to be improved more.

  1. Performance Improvement:

a. Tune the parameters/hyperparameters to improve the performance of the base model.

Accuracy : 0.823-->0.839

Spondylolisthesis recall : 0.853 --> 0.966

while above both have improved, Disk Hernia recall has dropped 0.750 --> 0.538

given that accuracy improves while recall reduces, lets not decide on this single record deletion

after studying other options, lastly lets attempt the outlier deletion for decision making.

standardisation has improved accuracy and as a standard measure for any analytics

we would continue using standardised data

Disk Hernia is of very low proportions

clearly Normal & Disk Hernia classes are upsampled

The data balancing has helped to improve the accuracy further , and recall improvement for Disk Hernia

Lets try ADASYN balancing if better results would be arrived

ADASYN based balancing caused further loss of accuracy & recall scores

Hence we would stick with SMOTE going forwards for this usecase

Hyperparameter tuning shows results similar to non tuned model (compared to SMOTE balanced model)

in this use case our initial hyperparameters were incidentally the best

Lets see if we could further improve our results

Polynomial featuring has failed to improve reults, hence lets drop the concept

As mentioned earlier, to try outlier deletion with the tuning model, lets try

Once again Outlier deletion has not helped much

Based on the above studies,

we shall fix our option as

Standardised, SMOTE upsampled, Hyperparameter Tuned model

  1. Performance Improvement:

b. Clearly showcase improvement in performance achieved.

  1. Performance Improvement:

c. Clearly state which parameters contributed most to improve model performance.

What could be the probable reason?

As mentioned towards end of the study -

Reasons: obviously standardisation & upsampling release weights inherently associated with features & records hence a noticable improvement in results were found

Part Two

Project Objective:

Build a Machine Learning model to perform focused marketing by predicting the potential customers who will convert using the historical dataset.

  1. Data Understanding and Preparation:

a. Read both the Datasets ‘Data1’ and ‘Data 2’ as DataFrame and store them into two separate variables.

  1. Data Understanding and Preparation:

b. Print shape and Column Names and DataTypes of both the Dataframes.

Both Data1 & Data2 to be considered together as a single data base,

merged based on ID column, as both has different attributes

features like ZipCode is stored in numeric type, to be changed to object (categorical) type

yet others to be studied further

aparently, there is no output feature. lets study further

  1. Data Understanding and Preparation:

c. Merge both the Dataframes on ‘ID’ feature to form a single DataFrame

  1. Data Understanding and Preparation:

d. Change Datatype of below features to ‘Object ‘CreditCard’, ‘InternetBanking’, ‘FixedDepositAccount’, ‘Security’, ‘Level’, ‘HiddenScore’.

  1. Data Exploration and Analysis:

a. Visualize distribution of Target variable ‘LoanOnCard’ and clearly share insights.

less than 11% of customers have loan on their credit card

while the target variable is heavily imbalanced, still there is scope to indentify potential borrowers

  1. Data Exploration and Analysis:

b. Check the percentage of missing values and impute if required.

being the target variable and since only 0.4% of the dataset has missing values

let us drop the records

  1. Data Exploration and Analysis:

c. Check for unexpected values in each categorical variable and impute with best suitable value.

no unxpected values in categorical (object type) columns

All values are numeric from 0 to 4 only

Note: more that 2 categories in features, ensure to use dummies on LEVEL & HIDDENSCORE fields

Lets also check numerical columns for unexpected values

given that all the unexpected records have no LoanOnCard

and since relation with bank could not be negative

also since there is already a large imbalance in data, let us drop these records

  1. Data Preparation and model building:

a. Split data into X and Y.

definitely the features are of different scales, it is advicable to standardise the data

B. Split data into train and test. Keep 25% data reserved for testing.

  1. Data Preparation and Model Building

D. Print evaluation metrics for the model and clearly share insights

  1. Data Preparation and model building:

E. Balance the data using the right balancing technique.

F. Again train the same previous model on balanced data.

G. Print evaluation metrics and clearly share differences observed.

  1. Performance Improvement

A. Train a base model each for SVM, KNN.

accuracy improved after tuning from 88.9% of logisticRegression to 93.6% in KNN AND SVM